SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "db:Swepub ;pers:(Lu Zhonghai);pers:(Yao Yuan)"

Sökning: db:Swepub > Lu Zhonghai > Yao Yuan

  • Resultat 1-10 av 13
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Lu, Zhonghai, et al. (författare)
  • Aggregate flow-based performance fairness in CMPs
  • 2016
  • Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 13:4
  • Tidskriftsartikel (refereegranskat)abstract
    • In CMPs, multiple co-executing applications create mutual interference when sharing the underlying network-on-chip architecture. Such interference causes different performance slowdowns to different applications. To mitigate the unfairness problem, we treat traffic initiated from the same thread as an aggregate flow such that causal request/reply packet sequences can be allocated to resources consistently and fairly according to online profiled traffic injection rates. Our solution comprises three coherent mechanisms from rate profiling, rate inheritance, and rate-proportional channel scheduling to facilitate and realize unbiased workload-adaptive resource allocation. Full-system evaluations in GEM5 demonstrate that, compared to classic packet-centric and latest application-prioritization approaches, our approach significantly improves weighted speed-up for all multi-application mixtures and achieves nearly ideal performance fairness.
  •  
2.
  • Lu, Zhonghai, et al. (författare)
  • Dynamic Traffic Regulation in NoC-Based Systems
  • 2017
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : IEEE Press. - 1063-8210 .- 1557-9999. ; 25:2, s. 556-569
  • Tidskriftsartikel (refereegranskat)abstract
    • In network-on-chip (NoC)-based systems, performance enhancement has primarily focused on the network itself, with little attention paid on controlling traffic injection at the network boundary. This is unsatisfactory because traffic may be over injected, aggravating congestion, and lowering performance. Recently, traffic regulation is proposed as an orthogonal means for performance improvement. Rather than as soon as possible admission, traffic regulation may hold back packet injection by admitting packets into the network only when the accumulated traffic volume at any time interval does not exceed a threshold. These regulation techniques are, however, often static, likely causing overregulation and underregulation. We propose dynamic traffic regulation to improve the system performance for NoC-based multi/many-processor systemson- chip (MPSoC) and chip multi/many-core processor (CMP) designs. It can be applied to MPSoCs for intellectual property integration in an open-loop fashion by injecting traffic according to its run-time profiled characteristics. It can also be applied to CMPs in a closed-loop fashion by admitting traffic fully adaptive to the traffic and network states. Through extensive experiments and results, we show that both the open-loop and closed-loop dynamic regulation techniques can significantly improve the network and system performance.
  •  
3.
  •  
4.
  • Lu, Zhonghai, et al. (författare)
  • Thread Voting DVFS for Manycore NoCs
  • 2018
  • Ingår i: IEEE Transactions on Computers. - : IEEE Computer Society. - 0018-9340 .- 1557-9956. ; 67:10, s. 1506-1524
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a thread-voting DVFS technique for manycore networks-on-chip (NoCs). This technique has two remarkable features which differentiate from conventional NoC DVFS schemes. (1) Not only network-level but also thread-level runtime performance indicatives are used to guide DVFS decisions. (2) To resolve multiple perhaps conflicting performance indicatives from many cores, it allows each thread to 'vote' for a V/F level in its own performance interest, and a region-based V/F controller makes dynamic per-region V/F decision according to the major vote. We evaluate our technique on a 64-core CMP in full-system simulation environment GEM5 with both PARSEC and SPEC OMP2012 benchmarks. Compared to a network metric (router buffer occupancy) based approach, it can improve the network energy efficacy measured in MPPJ (million packets per joule) by up to 22 percent for PARSEC and 20 percent for SPEC OMP2012, and the system energy efficacy measured in MIPJ (million instructions per joule) by up to 35 percent for PARSEC and 33 percent for SPEC OMP2012. 
  •  
5.
  • Lu, Zhonghai, et al. (författare)
  • Towards stochastic delay bound analysis for network-on-chip
  • 2015
  • Ingår i: Proceedings - 2014 8th IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2014. - 9781479953479 ; , s. 64-71
  • Konferensbidrag (refereegranskat)abstract
    • We propose stochastic performance analysis in order to provide probabilistic quality-of-service guarantees in on-chip packet-switching networks. In contrast to deterministic analysis which gives per-flow absolute delay bound, stochastic analysis derives per-flow probabilistic delay bounding function, which can be used to avoid over-dimensioning network resources. Based on stochastic network calculus, we build a basic analytic model for an on-chip router, propose and exemplify a stochastic performance analysis flow. In experiments, we show the correctness and accuracy of our analysis, and exhibit its potential in enhancing network utilization with a relaxed delay requirement. Moreover, the benefits of such relaxation is demonstrated through a video playback application.
  •  
6.
  • Yao, Yuan, et al. (författare)
  • DVFS for NoCs in CMPs : A thread voting approach
  • 2016
  • Ingår i: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). - : Institute of Electrical and Electronics Engineers (IEEE). - 9781467392112 ; , s. 309-320
  • Konferensbidrag (refereegranskat)abstract
    • As the core count grows rapidly, dynamic voltage/frequency scaling (DVFS) in networks-on-chip (NoCs) becomes critical in optimizing energy efficacy in chip multiprocessors (CMPs). Previously proposed techniques often exploit inherent network-level metrics to do so. However, such network metrics may contradictorily reflect application's performance need, leading to power over/under provisioning. We propose a novel on-chip DVFS technique for NoCs that is able to adjust per-region V/F level according to voted V/F levels of communicating threads. Each region is composed of a few adjacent routers sharing the same V/F level. With a voting-based approach, threads seek to influence the DVFS decisions independently by voting for a preferred V/F level that best suits their own performance interest according to their runtime profiled message generation rate and data sharing characteristics. The vote expressed in a few bits is then carried in the packet header and spread to the routers on the packet route. The final DVFS decision is made democratically by a region DVFS controller based on the majority election result of collected votes from all active threads. To achieve scalable V/F adjustment, each region works independently, and the voting-based V/F tuning forms a distributed decision making process. We evaluate our technique with detailed simulations of a 64-core CMP running a variety of multi-threaded PARSEC benchmarks. Compared with a network without DVFS and a network metric (router buffer occupancy) based approach, experimental results show that our voting based DVFS mechanism improves the network energy efficacy measured in MPPJ (million packets per joule) by about 17.9% and 9.7% on average, respectively, and the system energy efficacy measured in MIPJ (million instructions per joule) by about 26.3% and 17.1% on average, respectively.
  •  
7.
  • Yao, Yuan, et al. (författare)
  • Fuzzy flow regulation for Network-on-Chip based chip multiprocessors systems
  • 2014
  • Ingår i: 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC). - : IEEE. - 9781479928163 ; , s. 343-348
  • Konferensbidrag (refereegranskat)abstract
    • Flow regulation is a traffic shaping technique, which can be used to improve communication performance with better utilization of network resources in chip multi-processors (CMPs). This paper presents fuzzy flow regulation. Being different from the static flow regulation policy, our system makes regulation decisions fully dynamically according to traffic dynamism and the state of interconnection network. The central idea is to use fuzzy logic to mimic the behavior of an expert that can recognize the network status and then intelligently control the admission of input flows. As the experiment results show, the maximum improvement in average delay reaches 53.0% against static regulation and 37.4% against no regulation. The maximum improvement in average throughput reaches 37.5% against static regulation and 23.8% against no regulation.
  •  
8.
  • Yao, Yuan, et al. (författare)
  • INPG : Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores
  • 2018
  • Ingår i: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). - : IEEE Computer Society. - 9781538636596 ; , s. 15-26
  • Konferensbidrag (refereegranskat)abstract
    • As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive 'normal' NoC routers which only transmit packets into active 'big' ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.
  •  
9.
  • Yao, Yuan, et al. (författare)
  • Memory-Access Aware DVFS for Network-on-Chip in CMPs
  • 2016
  • Ingår i: PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE). - Singapore : IEEE conference proceedings. - 9783981537079 ; , s. 1433-1436
  • Konferensbidrag (refereegranskat)abstract
    • We present a new DVFS technique for network-on-chip (NoC) that adjusts the voltage/frequency scales of routers according to memory-access characteristics of application running on the CMP. The memory characteristics are periodically profiled, reflecting both resource-access density in the network and memory-access criticality for application performance. The network conducts per-router voltage/frequency tuning using the memory-access density information while it performs priority-based switch allocation to speed up critical packets and avoid starvation using the memory-criticality information. Compared to a latest per-router DVFS approach, benchmark experiments demonstrate that our memory-access characteristics aware DVFS technique achieves not only better power saving, energy-delay product, but also enhanced network and application performance.
  •  
10.
  • Yao, Yuan, et al. (författare)
  • Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs
  • 2016
  • Ingår i: Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. - : IEEE. - 9781467389471 ; , s. 279-290
  • Konferensbidrag (refereegranskat)abstract
    • With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 13
Typ av publikation
konferensbidrag (7)
tidskriftsartikel (5)
doktorsavhandling (1)
Typ av innehåll
refereegranskat (12)
övrigt vetenskapligt/konstnärligt (1)
Författare/redaktör
Yao, Yuan, 1986- (2)
Jiang, Y. (1)
Carlson, Trevor E., ... (1)
Lärosäte
Kungliga Tekniska Högskolan (13)
Språk
Engelska (13)
Forskningsämne (UKÄ/SCB)
Teknik (12)
Naturvetenskap (2)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy